8 research outputs found

    End-to-end Recurrent Denoising Autoencoder Embeddings for Speaker Identification

    Get PDF
    Speech 'in-the-wild' is a handicap for speaker recognition systems due to the variability induced by real-life conditions, such as environmental noise and emotions in the speaker. Taking advantage of representation learning, on this paper we aim to design a recurrent denoising autoencoder that extracts robust speaker embeddings from noisy spectrograms to perform speaker identification. The end-to-end proposed architecture uses a feedback loop to encode information regarding the speaker into low-dimensional representations extracted by a spectrogram denoising autoencoder. We employ data augmentation techniques by additively corrupting clean speech with real life environmental noise and make use of a database with real stressed speech. We prove that the joint optimization of both the denoiser and the speaker identification module outperforms independent optimization of both modules under stress and noise distortions as well as hand-crafted features.Comment: 8 pages + 2 of references + 5 of images. Submitted on Monday 20th of July to Elsevier Signal Processing Short Communication

    Speaker recognition under stress conditions

    Get PDF
    Proceeding of: IberSPEECH 2018, 21-23 November 2018, Barcelona, SpainSpeaker recognition systems exhibit a decrease in performance when the input speech is not in optimal circumstances, for example when the user is under emotional or stress conditions. The objective of this paper is measuring the effects of stress on speech to ultimately try to mitigate its consequences on a speaker recognition task. On this paper, we develop a stress-robust speaker identification system using data selection and augmentation by means of the manipulation of the original speech utterances. An extensive experimentation has been carried out for assessing the effectiveness of the proposed techniques. First, we concluded that the best performance is always obtained when naturally stressed samples are included in the training set, and second, when these are not available, their substitution and augmentation with synthetically generated stress-like samples, improves the performance of the system.This work is partially supported by the Spanish Government-MinECo projects TEC2014-53390-P and TEC2017-84395-P

    Ecology & computer audition: applications of audio technology to monitor organisms and environment

    Get PDF
    Among the 17 Sustainable Development Goals (SDGs) proposed within the 2030 Agenda and adopted by all the United Nations member states, the 13th SDG is a call for action to combat climate change. Moreover, SDGs 14 and 15 claim the protection and conservation of life below water and life on land, respectively. In this work, we provide a literature-founded overview of application areas, in which computer audition – a powerful but in this context so far hardly considered technology, combining audio signal processing and machine intelligence – is employed to monitor our ecosystem with the potential to identify ecologically critical processes or states. We distinguish between applications related to organisms, such as species richness analysis and plant health monitoring, and applications related to the environment, such as melting ice monitoring or wildfire detection. This work positions computer audition in relation to alternative approaches by discussing methodological strengths and limitations, as well as ethical aspects. We conclude with an urgent call to action to the research community for a greater involvement of audio intelligence methodology in future ecosystem monitoring approaches

    Aprendizaje automático de modelos de atención visual en el ámbito gastronómico

    No full text
    Este Trabajo Fin de Grado pretende desarrollar un modelo SVM que consiga predecir la atención visual humana en vídeos de ámbito gastronómico. El modelo se entrena con características y fijaciones visuales humanas capturadas en vídeos culinarios. Para comprobar su funcionamiento se evalúa su precisión contra los datos de atención visual de sujetos sanos mientras visualizan esos vídeos gastronómicos. Como el objetivo es encontrar la mejor solución para nuestro trabajo, se han seguido varias fases: se ha creado una base de datos de la atención visual en los vídeos, se han modelado las características que mejor se han adaptado a la tarea de interés -las recetas de cocina-, y se ha diseñado y elegido un clasificador adecuado para este problema, una máquina de vectores soporte. Finalmente, se ha evaluado el modelo comparándolo con algoritmos de referencia como Itti & Koch y Harel & Koch.This Final Project Degree aims to develop a SVM model that gets to predict human visual attention in gastronomic videos. The model is trained by features and human fixations captured from culinary videos. To check its performance, its accuracy is evaluated against data from human subjects' visual attention while they watch those gastronomic videos. As the main objective is finding the best solution to our problem, several phases have been followed: recording of visual attention data from videos in a database, modelling features that best have adapted to our task of interest, -cook recipes-, and designing and choosing the most appropriate classifier for our problem, a support vector machine. Finally, the model has been evaluated comparing it to current state of the art algorithms, ltti & Koch's and Harel & Koch's.Ingeniería de Sistemas Audiovisuale

    Multimodal Affective Computing in Wearable Devices with Applications in the Detection of Gender-based Violence

    No full text
    Mención Internacional en el título de doctorAccording to the World Health Organization (WHO), 1 out of every 3 women suffer from physical or sexual violence in some point of their lives, reflecting the effect of Gender-based Violence (GBV) in the world. In particular in Spain, more than 1, 100 women have been assassinated from 2003 to 2022, victims of gender-based violence. There is an urgent need for solutions to this prevailing problem in our society. They may involve the appropriate investment in technological research, among legislative, educational and economical efforts. An Artificial Intelligence (AI) driven solution that made a comprehensive analysis of aspects such as the person’s emotional state, plus a context or external situation analysis (e.g.: circumstances, location) and therefore automatically detect when a woman’s security is in danger, could provide an automatic and fast response to ensure women’s safety. Thus, this PhD thesis stems from the need to detect gender-based violence risk situations for women, addressing the problem from a multidisciplinary point of view by bringing together Artificial Intelligence (AI) technologies and a gender perspective. More specifically, we direct the focus to the auditory modality, analysing speech data produced by the user given that voice can be recorded unobtrusively, can be used as a personal identifier and indicator of affective sates reflected in it. The immediate response in a human being when in a situation of risk or danger is the fight-flight-freeze response. Several physiological changes affect the body: breathing, heart rate, muscle activation including the complex speech production apparatus and vocalisation characteristics, affecting our speech production. Due to all these physical and physiological changes and their involuntary nature as a result of being in a situation of risk, we considered relying on physiological signals such as pulse, perspiration, respiration, and also speech, in order to detect the emotional state of a person with the intention of recognising fear, which could be a consequence of being in a threatening situation. For such, we developed “Bindi”. This is a an end-to-end, AI-driven, inconspicuous, connected, edge computation-based, and wearable solution targeting the automatic detection of GBV situations. It consists of two smart devices that monitor the physiological variables and the acoustic environment including voice of an individual, connected to a smartphone and a cloud system able to call for help. Ideally, in order to build a Machine Learning or Deep Learning Artificial Intelligence system for the automatic detection of risk situations from auditory data, we would like to count on speech recorded under realistic conditions belonging to the target user. In our first steps, we found the difficulty of the lack of suitable data available, as there were non-existent (or non-available) speech datasets of real fear (not acted) currently in the literature. Real, original, spontaneous, in-the-wild and emotional speech are the ideal categories we needed for our application. Therefore, we decided to choose stress as the closest emotion to the target scenario possible for data collection to be able to flesh out the algorithms and acquire the knowledge needed. Thus, we describe and justify the use of datasets containing such emotion as the starting point of our investigation. Additionally, we describe the need for the creation of our own set of datasets to fill such literature niche. Then, members of our UC3M4Safety team captured the UC3M4Safety Audiovisual Stimuli Database, a dataset of 42 audiovisual stimuli to elicit emotions. Using them, we contributed to the community with the collection of WEMAC, a multi-modal dataset, which comprises a laboratory-based experiment for women volunteers that were exposed to the UC3M4Safety Audiovisual Stimuli Database. It aims to induce real emotions by using a virtual reality headset while the user’s physiological, speech signals and self-reports are collected. But recording emotional speech in fearful conditions that is realistic and spontaneous is very difficult, if not impossible. To get as close as possible to these conditions and hopefully record fearful speech, the UC3M4Safety team created the WE-LIVE database. With it we collected physiological, auditory and contextual signals from women in real-life conditions, as well as the labelling of their emotional reactions to everyday events in their lives, using the current Bindi system (wristband, pendant, mobile application and server). In order to detect GBV risk situations through speech, we first need to detect the voice of the specific user we are interested in, a speaker recognition task, among all the information contained in the audio signal. Thus, we aim to track the user’s voice separating it from the rest of the speakers in the acoustic scenario, trying to avoid the influence of emotions or ambient noise on the identification of the speaker as these factors could be detrimental for it. We study speaker recognition systems under two variability conditions, 1) speaker identification under stress conditions, to see how much these stress conditions affect speaker recognition systems and, 2) speaker recognition under real-life noisy conditions, isolating the speaker’s identity, among all additional information contained in the audio signal. We also dive into the development of the Bindi system for the recognition of fear-related emotions. We describe the architectures in Bindi versions 1.0 and 2.0, the evolution from one another, together with their implementation. We explain the approach followed for the design of a cascade multimodal system for Bindi 1.0, and also the design of a complete Internet of Things system with edge, fog and cloud computing components, for Bindi 2.0; specifically detailing how we designed the intelligence architectures in the Bindi devices for fear detection in the user. We then perform monomodal inference first by targeting the detection of realistic stress through speech. Later, as core experimentation, we work with WEMAC for the task of fear detection using data fusion strategies. The experimental results show an average accuracy of fear recognition of 63.61% with the Leave-hAlf-Subject-Out (LASO) method, which is a speaker-adapted subject-dependent training classification strategy. To the best of the UC3M4Safety team’s knowledge, this is the first time that a multimodal fusion of physiological and speech data for fear recognition has been given in this GBV context. Besides, this is the first time a LASO model considering fear recognition, multisensorial signal fusion, and virtual reality stimuli has been presented. We even explored how the gender-based violence victim condition could be detected only by speech paralinguistic cues. Overall, this thesis explores the use of audio technology and artificial intelligence to prevent and combat gender-based violence. We hope that we have lit the way for it in the speech community and beyond and that our experimentation, findings and conclusions can help in future research. The ultimate goal of this work is to ignite the community’s interest in developing solutions to the very challenging problem of GBV.I would like to thank the following institutions for the financial support received to complete this thesis: • Department of Research and Innovation of Madrid Regional Authority, for the EMPATIA-CM research project (reference Y2018/TCS-5046). • Spanish State Research Agency, for the SAPIENTAE4Bindi Project (reference PDC2021-121071-I00) funded by MCINAEI10.13039/501100011033 and by the European Union ”NextGenerationEU/PRTR”. • Community of Madrid YEI Predoctoral Program, for the Predoctoral Research Personnel in Training (PEJD-2019-PRE/TIC-16295) scholarship. • Spanish Ministry of Universities, for the University Teacher in Training ["Formación de Personal Universitario (FPU)"] grant FPU19/00448 and the Supplementary Short-stay Mobility 2020 Grant for beneficiaries of the University Teacher in Training program ["Ayudas complementarias de movilidad Estancias Breves 2020 destinadas a beneficiarios del programa de Formación del Profesorado Universitario (FPU)"]. • German Academic Exchange Service (DAAD), for the Short Term Grant Scholarship 2020Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidenta: Elisabeth André.- Secretaria: Elena Romero Perales.- Vocal: Emilia María Parada Cabaleir

    Bindi: Affective internet of things to combat gender-based violence

    Get PDF
    The main research motivation of this article is the fight against gender-based violence and achieving gender equality from a technological perspective. The solution proposed in this work goes beyond currently existing panic buttons, needing to be manually operated by the victims under difficult circumstances. Instead, Bindi, our end-to-end autonomous multimodal system, relies on artificial intelligence methods to automatically identify violent situations, based on detecting fear-related emotions, and trigger a protection protocol, if necessary. To this end, Bindi integrates modern state-of-the-art technologies, such as the Internet of Bodies, affective computing, and cyber-physical systems, leveraging: 1) affective Internet of Things (IoT) with auditory and physiological commercial off-the-shelf smart sensors embedded in wearable devices; 2) hierarchical multisensorial information fusion; and 3) the edge-fog-cloud IoT architecture. This solution is evaluated using our own data set named WEMAC, a very recently collected and freely available collection of data comprising the auditory and physiological responses of 47 women to several emotions elicited by using a virtual reality environment. On this basis, this work provides an analysis of multimodal late fusion strategies to combine the physiological and speech data processing pipelines to identify the best intelligence engine strategy for Bindi. In particular, the best data fusion strategy reports an overall fear classification accuracy of 63.61% for a subject-independent approach. Both a power consumption study and an audio data processing pipeline to detect violent acoustic events complement this analysis. This research is intended as an initial multimodal baseline that facilitates further work with real-life elicited fear in women.This work was supported in part by the Department of Research and Innovation of Madrid Regional Authority, in the EMPATIA-CM Research Project (Reference Y2018/TCS-5046) funded by MCIN/AEI/10.13039/501100011033 under Grant PDC2021-121071-I00; in part by the European Union NextGenerationEU/PRTR in part by the Spanish Ministry of Universities with the FPU under Grant FPU19/00448; and in part by the Madrid Government (Comunidad de Madrid-Spain) through the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M26), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)

    WEMAC: Women and Emotion Multi-modal Affective Computing dataset

    Full text link
    Among the seventeen Sustainable Development Goals (SDGs) proposed within the 2030 Agenda and adopted by all the United Nations member states, the Fifth SDG is a call for action to turn Gender Equality into a fundamental human right and an essential foundation for a better world. It includes the eradication of all types of violence against women. Within this context, the UC3M4Safety research team aims to develop Bindi. This is a cyber-physical system which includes embedded Artificial Intelligence algorithms, for user real-time monitoring towards the detection of affective states, with the ultimate goal of achieving the early detection of risk situations for women. On this basis, we make use of wearable affective computing including smart sensors, data encryption for secure and accurate collection of presumed crime evidence, as well as the remote connection to protecting agents. Towards the development of such system, the recordings of different laboratory and into-the-wild datasets are in process. These are contained within the UC3M4Safety Database. Thus, this paper presents and details the first release of WEMAC, a novel multi-modal dataset, which comprises a laboratory-based experiment for 47 women volunteers that were exposed to validated audio-visual stimuli to induce real emotions by using a virtual reality headset while physiological, speech signals and self-reports were acquired and collected. We believe this dataset will serve and assist research on multi-modal affective computing using physiological and speech information

    UC3M4Safety Database description

    No full text
    EMPATIA-CM (Comprehensive Protection of Gender-based Violence Victims through Multimodal Affective Computing) is a research project that aims to generally understand the reactions of Gender-based Violence Victims to situations of danger, generate mechanisms for automatic detection of these situations and study how to react in a comprehensive, coordinated and effective way to protect them in the most optimal way possible. The project is divided into five objectives that demonstrate the need and added value of the multidisciplinary approach: * Understand the reaction mechanisms of the Gender-based Violence Victims to risky situations. * Investigate, design and verify algorithms to automatically detect Risk Situations in Gender-based Violence Victims. * Design and implement the Automatic Detection System for Risk Situations in Gender-based Violence Victims. * Investigate a new protocol to protect Gender-based Violence Victims with a holistic approach. * Use the data collected by the System for detecting hazardous situations in Gender-based Violence Victims.This project is funded by the Comunidad de Madrid, Consejería de Ciencia, Universidades e Innovación, Programa de proyectos sinérgicos de I+D en nuevas y emergentes áreas científicas en la frontera de la ciencia y de naturaleza interdisciplinar, cofinanciada con los Programas Operativos del Fondo Social Europeo y del Fondo Europeo de Desarrollo Regional, 2014-2020, de la Comunidad de Madrid (EMPATÍA-CM, Ref: Y2018/TCS-5046
    corecore